{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HF-Upload\n",
    "\n",
    "- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus)\n",
    "- Design: \n",
    "- Peer Review : \n",
    "- Proofread:\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embedding/04-UpstageEmbeddings.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embedding/04-UpstageEmbeddings.ipynb)\n",
    "\n",
    "## Overview\n",
    "\n",
    "The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environement Setup](#environment-setup)\n",
    "- [Upload Generated Dataset](#upload-generated-dataset)\n",
    "- [Upload to HuggingFace Dataset](#upload-to-huggingface-dataset)\n",
    "\n",
    "\n",
    "### References\n",
    "- [Huggingface / Share a dataset to the Hub](https://huggingface.co/docs/datasets/upload_dataset)\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    " **[Note]** \n",
    "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n",
    "\n",
    "### API Key Configuration\n",
    "To use `HuggingFace Dataset` , you need to [obtain a HuggingFace write token](https://huggingface.co/settings/tokens).\n",
    "\n",
    "Once you have your API key, set it as the value for the variable `HUGGINGFACEHUB_API_TOKEN` ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "[notice] A new release of pip is available: 24.3.1 -> 25.0.1\n",
      "[notice] To update, run: python.exe -m pip install --upgrade pip\n"
     ]
    }
   ],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\"datasets\"],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can set API keys in a `.env` file or set them manually.\n",
    "\n",
    "[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.\n",
    "if not load_dotenv(override=True):\n",
    "    set_env(\n",
    "        {\n",
    "            \"LANGCHAIN_API_KEY\": \"\",\n",
    "            \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "            \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "            \"LANGCHAIN_PROJECT\": \"\", # set the project name same as the title\n",
    "            \"HUGGINGFACEHUB_API_TOKEN\": \"\",\n",
    "        }\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload Generated Dataset\n",
    "Import the **pandas library** for data upload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_input</th>\n",
       "      <th>reference_contexts</th>\n",
       "      <th>reference</th>\n",
       "      <th>synthesizer_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Wht is an API?</td>\n",
       "      <td>[\"Agents\\nThis combination of reasoning,\\nlogi...</td>\n",
       "      <td>An API can be used by a model to make various ...</td>\n",
       "      <td>single_hop_specifc_query_synthesizer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are the three essential components in an ...</td>\n",
       "      <td>['Agents\\nWhat is an agent?\\nIn its most funda...</td>\n",
       "      <td>The three essential components in an agent's c...</td>\n",
       "      <td>single_hop_specifc_query_synthesizer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What Chain-of-Thought do in agent model, how i...</td>\n",
       "      <td>['Agents\\nFigure 1. General agent architecture...</td>\n",
       "      <td>Chain-of-Thought is a reasoning and logic fram...</td>\n",
       "      <td>single_hop_specifc_query_synthesizer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Waht is the DELETE method used for?</td>\n",
       "      <td>['Agents\\nThe tools\\nFoundational models, desp...</td>\n",
       "      <td>The DELETE method is a common web API method t...</td>\n",
       "      <td>single_hop_specifc_query_synthesizer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>How do foundational components contribute to t...</td>\n",
       "      <td>['&lt;1-hop&gt;\\n\\nAgents\\ncombining specialized age...</td>\n",
       "      <td>Foundational components contribute to the cogn...</td>\n",
       "      <td>NewMultiHopQuery</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          user_input  \\\n",
       "0                                     Wht is an API?   \n",
       "1  What are the three essential components in an ...   \n",
       "2  What Chain-of-Thought do in agent model, how i...   \n",
       "3                Waht is the DELETE method used for?   \n",
       "4  How do foundational components contribute to t...   \n",
       "\n",
       "                                  reference_contexts  \\\n",
       "0  [\"Agents\\nThis combination of reasoning,\\nlogi...   \n",
       "1  ['Agents\\nWhat is an agent?\\nIn its most funda...   \n",
       "2  ['Agents\\nFigure 1. General agent architecture...   \n",
       "3  ['Agents\\nThe tools\\nFoundational models, desp...   \n",
       "4  ['<1-hop>\\n\\nAgents\\ncombining specialized age...   \n",
       "\n",
       "                                           reference  \\\n",
       "0  An API can be used by a model to make various ...   \n",
       "1  The three essential components in an agent's c...   \n",
       "2  Chain-of-Thought is a reasoning and logic fram...   \n",
       "3  The DELETE method is a common web API method t...   \n",
       "4  Foundational components contribute to the cogn...   \n",
       "\n",
       "                       synthesizer_name  \n",
       "0  single_hop_specifc_query_synthesizer  \n",
       "1  single_hop_specifc_query_synthesizer  \n",
       "2  single_hop_specifc_query_synthesizer  \n",
       "3  single_hop_specifc_query_synthesizer  \n",
       "4                      NewMultiHopQuery  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"data/ragas_synthetic_dataset.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload to HuggingFace Dataset\n",
    "Convert a **Pandas DataFrame(`df`)** to a Hugging Face Dataset and proceed with the upload."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset({\n",
      "    features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],\n",
      "    num_rows: 10\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "from datasets import Dataset\n",
    "\n",
    "# Convert pandas DataFrame to Hugging Face Dataset\n",
    "dataset = Dataset.from_pandas(df)\n",
    "\n",
    "# Check the dataset\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "40ee3ff14ca44bbbbbcd9f17a12df54a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No files have been modified since last commit. Skipping to prevent empty commit.\n"
     ]
    }
   ],
   "source": [
    "from datasets import Dataset\n",
    "import os\n",
    "\n",
    "# Convert pandas DataFrame to Hugging Face Dataset\n",
    "dataset = Dataset.from_pandas(df)\n",
    "\n",
    "# Set dataset name (change to your desired name)\n",
    "hf_username = \"icarus1026\"  # Your Hugging Face Username(ID)\n",
    "dataset_name = f\"{hf_username}/rag-synthetic-dataset\"\n",
    "\n",
    "# Upload dataset\n",
    "dataset.push_to_hub(\n",
    "    dataset_name,\n",
    "    private=True,  # Set private=False for a public dataset\n",
    "    split=\"test_v1\",  # Enter dataset split name\n",
    "    token=os.getenv(\"HUGGINGFACEHUB_API_TOKEN\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Note] The Dataset Viewer may take some time to display.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-kr-bpXWMSjn-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
